Hierarchical Clustering Using Mutual Information 
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We present a method for hierarchical clustering of data called mutual information clustering 
(MIC) algorithm. It uses mutual information (MI) as a similarity measure and exploits its grouping 
property: The MI between three objects X, Y, and Z is equal to the sum of the MI between X 
and Y, plus the MI between Z and the combined object (XY). We use this both in the Shannon 
(probabilistic) version of information theory and in the Kolmogorov (algorithmic) version. We apply 
our method to the construction of phylogenetic trees from mitochondrial DNA sequences and to the 
output of independent components analysis (ICA) as illustrated with the ECG of a pregnant woman. 



Classification or organizing of data is very important 
in all scientific disciplines and is fundamental for under- 
standing and learning [jj . Classification can be exclusive 
or overlapping, supervised or unsupervised. In the follow- 
ing we will be interested only in exclusive unsupervised 
classification, called clustering. 

An instance of a clustering problem consist of a set of 
objects and a set of properties (called characteristic vec- 
tor) for each object. The goal of clustering is separation 
of objects into groups using only the characteristic vec- 
tors. Cluster analysis organizes data either as a single 
grouping of individuals into non-overlapping clusters or 
as a hierarchy of nested partitions. The latter is called 
hierarchical clustering (HC). Because of wide spread of 
applications, there are a large variety of different cluster- 
ing methods in usage, see e.g. for an overview. 

The crucial point of all clustering algorithms is the 
choice of a proximity measure. This is obtained from the 
characteristic vectors and can be either an indicator for 
similarity or dissimilarity. In the latter case it is conve- 
nient but not obligatory to satisfy the standard axioms 
of a metric (positivity, symmetry, and triangle inequal- 
ity). Among HC methods one should distinguish between 
those where one uses the characteristic vectors only at the 
first level of the hierarchy and derives the proximities be- 
tween clusters from the proximities of their constituents, 
and methods where the proximities are calculated each 
time from their characteristic vectors. The latter strat- 
egy (which is used also in the present paper) allows of 
course for more flexibility but might also be computa- 
tionally more costly. 

Quite generally, the "objects" to be clustered can be 
either single (finite) patterns (e.g. DNA sequences) or 
random variables, i.e. probability distributions. In the 
latter case the data are usually supplied in form of a sta- 
tistical sample, and one of the simplest and most widely 
used similarity measures is the linear (Pearson) corre- 
lation coefficient. But this is not sensitive to nonlinear 
dependencies which do not manifest themselves in the 
covariance and can thus miss important features. This 
is in contrast to mutual information (MI) which is also 
singled out by its information theoretic background 0. 
Indeed, MI is zero only if the two random variables are 



strictly independent. 

Another important feature of MI is that it has also 
an "algorithmic" cousin, defined within algorithmic (Kol- 
mogorov) information theory Q which measures the sim- 
ilarity between individual objects. For a thorough discus- 
sion of distance measures based on algorithmic MI and 
for their application to clustering, see jj, @ . 

Essential for the present application is the grouping 
property of MI, 



I(X,Y,Z) = I(X,Y) + I((X,Y),Z). 



(1) 



Within Shannon information theory this is an exact the- 
orem, while it is true in the algorithmic version up to 
the usual logarithmic correction terms Q ■ Since X, Y, 
and Z can be themselves composite, Eq.Q can be used 
recursively for a cluster decomposition of MI. This mo- 
tivates the main idea of our clustering method: instead 
of using e.g. centers of masses in order to treat clusters 
like individual objects in an approximative way only, we 
treat them exactly like individual objects when using MI 
as proximity measure and We thus propose the following 
scheme for clustering n objects with MIC: 

(1) Compute a proximity matrix based on pairwise mu- 
tual informations; assign n clusters such that each cluster 
contains exactly one object; 

(2) find the two closest clusters % and j; 

(3) create a new cluster (ij) by combining i and j; 

(4) delete the lines/columns with indices i and j from the 
proximity matrix, and add one line/column containing 
the proximities between cluster (ij) and all other clus- 
ters; 

(5) if the number of clusters is still > 2, goto (2); else 
join the two clusters and stop. 

Shannon Theory: Here, X = X\,Y = Xj, . . . are 
random variables. If they are discrete, entropies are de- 
fined as usual H(X) = —J2 i pi(X) log pi(X) etc. The 
MI is defined as 

n 

7(Xi, ...,X n ) = J2 H ( x k) - H(X 1: X n ). (2) 
fe=i 

Eq.JD can be checked easily, together with its general- 
ization to arbitrary groupings. It means that MI can be 



2 



decomposed into hierarchical levels. By iterating it, one 
can decompose I{X\ . . . X n ) for any n > 2 and for any 
partitioning of the set (X\ . . . X n ) into the Mis between 
elements within one cluster and Mis between clusters. 

For continuous variables with densities fix etc., one 
first introduces some binning ('coarse-graining'), and ap- 
plies the above to the binned variables. If a; is a vector 
with dimension m and each bin has Lebesgue measure 
A, then Pi(X) « /ix(x)A m with x chosen suitably in bin 
i, and 



H hin (X)^H(X)-mlogA 
where the differential entropy is given by 



H{X) 



dx fix (x) log nx{x) 



(3) 



(4) 



Notice that H\>i n (X) is a true (average) information and 
is thus non-negative, but H(X) is not an information, can 
be negative, and is not invariant under homeomorphisms 
x — > 4>(x). 

Joint entropies, conditional entropies, and MI are de- 
fined as above, with sums replaced by integrals. Like 
H(X), joint and conditional entropies are neither posi- 
tive (semi-)definite nor invariant. But MI, defined as 



I(X,Y) 



dxdy nxY{x 7 y) log 



lixy(x,y) 
Hx(x)n Y (y) 



(5) 



is non-negative and invariant under x —>■ 4>(x) and y — » 
ijj(y). It is (the limit of) a true information, 

I(X, Y) = Urn [H hia (X) + H hin (Y) - H hin (X, Y)}. (6) 

In applications, one usually has the data available in 
form of iV sample points (xi, y,), i = 1, . . . N which are 
assumed to be i.i.d. realizations. There exist numerous 
algorithms to estimate I(X, Y) and entropies. We use 
in the following the MI estimators proposed recently in 
Ref. [||, and we refer to this paper for a review of alter- 
native methods. 

Algorithmic Information Theory: In contrast to 
Shannon theory where the basic objects are random vari- 
ables and entropies are average informations, algorithmic 
information theory deals with individual symbol strings 
and with the actual information needed to specify them. 
To "specify" a sequence X means here to give the neces- 
sary input to a universal computer [/, such that U prints 
X on its output and stops. The analogon to entropy, 
called here usually the complexity K(X) of X, is the 
minimal length of an input which leads to the output 
X, for fixed U. It depends on U, but it can be shown 
that this dependence is weak and can be neglected in the 
limit when K(X) is large Q. 

Let us denote the concatenation of two strings X and 
Y as XY. Its complexity is K(XY). It is intuitively clear 
that K(XY) should be larger than K (X) but cannot be 



larger than the sum K(X) + K(Y). Finally, one expects 
that K(X\Y), defined as the minimal length of a program 
printing X when Y is furnished as auxiliary input, is 
related to K(XY) — K(Y). Indeed, one can show Q 
(again within correction terms which become irrelevant 
asymptotically) that 



< K(X\Y) ~ K{XY) - K{Y) < K(X). 



(7) 



Notice the close similarity with Shannon entropy. The 
algorithmic information in Y about X is finally 

hi s (X,Y) = K(X)-K{X\Y) ~ K(X)+K{Y)-K(XY), 

(8) 

and similarly for more than two strings. Within the same 
additive correction terms, one shows that it is symmet- 
ric, I a \ g (X,Y) = I a \ g (Y,X), and can thus serve as an 
analogon to mutual information. 

K [X) is in general not computable. But one can easily 
give upper bounds: The length of any input which pro- 
duces X (e.g. by spelling it out verbatim) is an upper 
bound. Improved upper bounds are provided by any file 
compression algorithm. 

Mi-Based Distance Measures: When comparing 
objects with different marginal or joint informations, it 
seems intuitively clear that one should prefer relative dis- 
tances over absolute ones, in order to minimize the de- 
pendence on the total information. We here use the quan- 
tity 



D(X,Y) = 1 



H(X,Y) 



(9) 



which is a metric , with D(X,X) = and D(X,Y) < 
1 for all pairs (X, Y) . The algorithmic version is also 
universal: If X ps Y according to any non-trivial distance 
measure, then X sa Y also according to D. 

A difficulty appears in the Shannon framework, if we 
deal with continuous random variables. As we men- 
tioned above, H(X, Y) is not invariant under homeomor- 
phisms (including rescalings) and not even positive defi- 
nite, while -ffbin diverges when A — » 0. We thus modified 
Eq.@ by replacing H(X,Y) by H hm (X,Y) and replac- 
ing D(X, Y) by the similarity measure 



S{X,Y) = fim o (D(A,r) - l)logA 



I(X,Y) 



(10) 



A Phylogenetic Tree for Mammals: We study 
the mitochondrial DNA of a group of 34 mammals (see 
Fig. 1). The same data || had previously been analyzed 
m mill This group includes among others some rodents, 
ferungulates, and primates. 

Obviously we are here dealing with the algorithmic 
version of information theory, and informations are es- 
timated by lossless data compression. For constructing 
the proximity matrix between individual taxa, we pro- 
ceed essentially as in Ref. 13 , us ing the special compres- 
sion program GenCompress [lfj. 
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FIG. 1: Phylogenetic tree for 34 mammals. The heights of 
nodes are the distances between the joining daughter clusters. 

In Ref. 1^1 , this proximity matrix was then used as the 
input to a standard HC algorithm (neighbour-joining and 
hypercleaning) to produce an evolutionary tree. Instead 
we use the MIC algorithm with distance D(X,Y). The 
joining of two clusters is obtained by simply concate- 
nating the DNA sequences. There is of course an arbi- 
trariness in the order of concatenation sequences: XY 
and YX give in general compressed sequences of differ- 
ent lengths. But we found this to have negligible effect 
on the evolutionary tree. 

The overall structure of this tree closely resembles the 
one shown in Ref. . All primates are correctly clustered 
and also the relative order of the ferungulates (blue whale 
to horse) is in accordance with Ref. On the other 
hand, there are a number of connections which obviously 
do not reflect the true evolutionary tree, see for example 
the guinea pig with bat and elephant with platypus. But 
the latter two, inspite of being joined together, have a 
very large distance from each other, thus their clustering 
just reflects the fact that neither the platypus nor the 
elephant have other close relatives in the sample. All in 
all, however, already the results shown in Fig. 1 capture 
surprisingly well the overall structure shown in Ref. 0. 
Dividing MI by the total information is essential for this 
success. If we had used the non- normalized / a i g (X, Y) it- 
self, the clustering algorithm used in Q would not change 
much, since all 34 DNA sequences have roughly the same 
length. But our MIC algorithm would be completely 
screwed up: After the first cluster formation, we have 
DNA sequences of very different lengths, and longer se- 
quences tend also to have larger MI, even if they are not 
closely related. 

The concatenation of X and Y will of course not lead 
to a plausible sequence of the common ancestor, but it 



optimally represents the information about it. This in- 
formation is essential to find the correct way through 
higher hierarchy levels of the evolutionary tree, and it is 
preserved in concatenating. 

Clustering of Minimally Dependent Compo- 
nents in an Electrocardiogram: As our second ap- 
plication we choose a case where Shannon theory is the 
proper setting. We show in Fig. 2 an ECG recorded from 
the abdomen and thorax of a pregnant woman It 
is already seen from this graph that there are at least 
two important components in this ECG: the heartbeat 
of the mother and of the fetus. In addition there is 
noise from various sources (muscle activity, measurement 
noise, etc.). While it is easy to detect anomalies in the 
mother's ECG from such a recording, it would be difficult 
to detect them in the fetal ECG. 

As a first approximation we can assume that the to- 
tal ECG is a linear superposition of several independent 
sources (mother, child, noisei, noise2,...). A standard 
method to disentangle such superpositions is indepen- 
dent component analysis (ICA) 12]. There, one tries 
to recover the sources by means of linear transformation 
Si(t) = Y^j=i WijXj(t), where Wij is determined by min- 
imizing the estimated MI between the Sj. 

In reality things are not so simple. For instance, the 
sources might not be independent, the number of sources 
(including noise sources!) might be different from the 
number of channels, and the mixing might involve de- 
lays. For the present case this implies that the heartbeat 
of the mother is seen in several reconstructed components 
Si, and that the "independent" components are not in- 
dependent at all. In particular, all components s$ which 
have large contributions from the mother form a clus- 
ter with large intra-cluster Mis and small inter-cluster 
Mis. The same is true for the fetal ECG, albeit less 
pronounced. To obtain clean recordings of the fetal and 
maternal ECGs, we proceeded as follows 01 . 

Since we expect different delays in the different chan- 
nels, we first used Takens delay embedding 0] with time 
delay 0.002 s and embedding dimension 3, resulting in 24 
channels. Wc then formed 24 linear combinations Si(t). 
We use the MI estimator |(|, for details see 0- Five of 
the resulting least dependent components contain strong 
contributions of the mother's heartbeat, three are domi- 
nated by the fetus. The rest contains mostly noise Q. 

In plotting the actual dendrogram (Fig. [3J) we used 
S(X, Y) for the cluster analysis but used the MI of 
the clusters to determine the height at which the two 
branches join. The MI of the first five channels, e.g., is 

1.44 nats, while that of channels 6 to 8 is « 0.3 nats. 
For any two clusters (tuples) X = X\ . . . X n and Y = 
Yi ... Y m one has I(X,Y) > I(X) + I(Y). This guar- 
antees, if the MI is estimated correctly, that the tree is 
drawn properly. The two slight glitches (when clusters 
(1 - 14) and (15 - 18) join, and when (21 - 22) is joined 
with 23) result from small errors in estimating MI. They 
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FIG. 2: ECG of a pregnant woman (sampling rate 500 Hz). 
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FIG. 3: Dendrogram for least dependent components. 

do in no way effect our conclusions. 

In Fig. |3 one can clearly see two big clusters corre- 
sponding to the mother and to the child. There are also 
some small clusters which should be considered as noise. 
For reconstructing the mother and child contributions to 
Fig. [2 we have to decide on one specific clustering from 
the entire hierarchy. We decided to make the cut at inter- 
cluster MI equal to 0.1, i.e. two clusters X and Y are 
joined whenever I((X), (Y)) = I{X, Y) - I{X) - I(Y) > 
0.1. Reconstructing the first five traces of the original 
ECG from the child components only, we obtain Fig. 0] 

In summary, we have shown that MI can not only be 
used as a proximity measure in clustering, but that it also 
suggests a conceptually very simple and natural hierar- 
chical clustering algorithm. We do not claim that this 
algorithm, called mutual information clustering (MIC), 
is always superior to other algorithms. Indeed, MI is in 
general not easy to estimate. Obviously, when only crude 
estimates are possible, also MIC will not give very good 
results. But as MI estimates are becoming better, also 
the results of MIC should improve. The present paper 
was partly triggered by the development of a new class 
of MI estimators for continuous random variables which 
have very small bias and also rather small variances [|J. 



FIG. 4: ECG where all contributions except those of the child 
cluster have been removed. 

We have illustrated our method with two applications, 
one from genetics and one from cardiology. For neither 
application MIC might give optimal clustering, but it 
seems promising that one common method gives decent 
results for both, although they are very different. 

The results of MIC should improve, if more data be- 
come available. This is trivial, if we mean by that longer 
time sequences in the application to ECG, and longer 
parts of the genome. It is less trivial that we expect MIC 
to make fewer mistakes in a phylogenetic tree, when more 
species are included. The reason is that close-by species 
will be correctly joined anyhow, and families - which 
now are represented only by single species and thus are 
poorly characterized - will be much better described by 
the concatenated genomes if more species are included. 

We would like to thank Arndt von Haesseler, Walter 
Nadler and Volker Roth for many useful discussions. 
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